18.1 Transcriptomics
275
tissues according to their gene expression profiles; it might be inferred that tissues
with the same or similar expression profile belong to the same clinical state.
If a set of experiments comprising samples prepared from cells grown under mm
different conditions has been carried out, then the set of normalized intensities (i.e.,
transcript abundances) for each experiment defines a point inmm-dimensional expres-
sion space, whose coordinates give the (normalized) degrees of expression. Distances
between the points can be calculated by, for example, the Euclidean distance metric,
that is,
d equals left bracket sigma summation Underscript i equals 1 Overscript m Endscripts left parenthesis a Subscript i Baseline minus b Subscript i Baseline right parenthesis squared right bracket Superscript 1 divided by 2 Baseline commad =
[ m
Σ
i=1
(ai −bi)2
]1/2
,
(18.1)
for two samples aa and bb subjected to mm different conditions. Clustering algorithms
(Sect. 13.2.1) can then be used to group transcripts on the basis of their similarities.
The hierarchical clustering procedure is the same as that used to construct phylogenies
(Sect. 17.7); that is, the closest pair of transcripts forms the first cluster, the transcript
with the closest mean distance to the first cluster forms the second cluster, and so
on. This is the unweighted pair-group method average (UPGMA); variants include
single-linkage clustering, in which the distance between two clusters is calculated as
the minimum distance between any members of the two clusters, and so on.
Fuzzy clustering algorithms may be more successful than the above
“hard”
schemes for large and complex datasets. Fuzzy schemes allow points to belong to
more than one cluster. Degree of membership is defined by
u Subscript r comma s Baseline equals 1 divided by sigma summation Underscript j equals 1 Overscript m Endscripts left parenthesis StartFraction d left parenthesis x Subscript r Baseline comma theta Subscript s Baseline right parenthesis Over d left parenthesis x Subscript r Baseline comma theta Subscript j Baseline right parenthesis EndFraction right parenthesis Superscript 1 divided by left parenthesis q minus 1 right parenthesis Baseline comma r equals 1 comma ellipsis comma upper N semicolon s equals 1 comma ellipsis comma m commaur,s = 1/
m
Σ
j=1
( d(xr, θs)
d(xr, θ j)
)1/(q−1)
,r = 1, . . . , N; s = 1, . . . , m,
(18.2)
forupper NN points andmm clusters (mm is given at the start of the algorithm), whered left parenthesis x Subscript i Baseline comma theta Subscript j Baseline right parenthesisd(xi, θ j)
is the distance between the point x Subscript ixi and the cluster represented by theta Subscript jθ j, and q greater than 1q > 1 is
the fuzzifying parameter. The cost function
sigma summation Underscript i equals 1 Overscript upper N Endscripts sigma summation Underscript j equals 1 Overscript m Endscripts u Subscript r comma s Superscript j Baseline d left parenthesis x Subscript i Baseline comma theta Subscript j Baseline right parenthesis
N
Σ
i=1
m
Σ
j=1
u j
r,sd(xi, θ j)
(18.3)
is minimized (subject to the condition that the u Subscript i comma jui, j sum to unity) and clustering
converges to cluster centres corresponding to local minima or saddle points of the
cost function. The procedure is typically repeated for increasing number of clusters
until some criterion for clustering quality becomes stable; for example, the partition
coefficient
left parenthesis 1 divided by upper N right parenthesis sigma summation Underscript i equals 1 Overscript upper N Endscripts sigma summation Underscript j equals 1 Overscript m Endscripts u Subscript i comma j Superscript 2 Baseline period(1/N)
N
Σ
i=1
m
Σ
j=1
u2
i, j.
(18.4)